---
subtitle: >-
  A high-level overview on how to use Opik's evaluation features including some
  code snippets
---

Evaluation in Opik helps you assess and measure the quality of your LLM outputs across different dimensions.
It provides a framework to systematically test your prompts and models against datasets, using various metrics
to measure performance.

![Opik Evaluation](/img/evaluation/evaluation_overview.gif)

Opik also provides a set of pre-built metrics for common evaluation tasks. These metrics are designed to help you
quickly and effectively gauge the performance of your LLM outputs and include metrics such as Hallucination,
Answer Relevance, Context Precision/Recall and more. You can learn more about the available metrics in the
[Metrics Overview](/evaluation/metrics/overview) section.

<Tip>
  If you are interested in evaluating your LLM application in production, please
  refer to the [Online evaluation guide](/production/rules). Online evaluation
  rules allow you to define LLM as a Judge metrics that will automatically score
  all, or a subset, of your production traces.
</Tip>

<Note>
  **New: Multi-Value Feedback Scores** - Opik now supports collaborative
  evaluation where multiple team members can score the same traces and spans.
  This reduces bias and provides more reliable evaluation results through
  automatic score aggregation. [Learn more
  →](/tracing/annotate_traces#multi-value-feedback-scores)
</Note>

## Running an Evaluation

Each evaluation is defined by a dataset, an evaluation task and a set of evaluation metrics:

1. **Dataset**: A dataset is a collection of samples that represent the inputs and, optionally, expected outputs for
   your LLM application.
2. **Evaluation task**: This maps the inputs stored in the dataset to the output you would like to score. The evaluation
   task is typically a prompt template or the LLM application you are building.
3. **Metrics**: The metrics you would like to use when scoring the outputs of your LLM

To simplify the evaluation process, Opik provides two main evaluation methods: `evaluate_prompt` for evaluation prompt
templates and a more general `evaluate` method for more complex evaluation scenarios.

<Tip>
  **TypeScript SDK Support** This document covers evaluation using Python, but
  we also offer full support for TypeScript via our dedicated TypeScript SDK.
  See the [TypeScript SDK Evaluation
  documentation](/reference/typescript-sdk/evaluation/overview) for
  implementation details and examples.
</Tip>

<Tabs>
    <Tab value="Evaluating Prompts" title="Evaluating Prompts">

To evaluate a specific prompt against a dataset:

```python
import opik
from opik.evaluation import evaluate_prompt
from opik.evaluation.metrics import Hallucination

# Create a dataset that contains the samples you want to evaluate
opik_client = opik.Opik()
dataset = opik_client.get_or_create_dataset("Evaluation test dataset")
dataset.insert([
    {"input": "Hello, world!", "expected_output": "Hello, world!"},
    {"input": "What is the capital of France?", "expected_output": "Paris"},
])

# Run the evaluation
result = evaluate_prompt(
    dataset=dataset,
    messages=[{"role": "user", "content": "Translate the following text to French: {{input}}"}],
    model="gpt-3.5-turbo",  # or your preferred model
    scoring_metrics=[Hallucination()]
)
```

</Tab>
<Tab value="Evaluating RAG applications and Agents" title="Evaluating RAG applications and Agents">

For more complex evaluation scenarios where you need custom processing:

```python
import opik
from opik.evaluation import evaluate
from opik.evaluation.metrics import ContextPrecision, ContextRecall

# Create a dataset with questions and their contexts
opik_client = opik.Opik()
dataset = opik_client.get_or_create_dataset("RAG evaluation dataset")
dataset.insert([
    {
        "input": "What are the key features of Python?",
        "context": "Python is known for its simplicity and readability. Key features include dynamic typing, automatic memory management, and an extensive standard library.",
        "expected_output": "Python's key features include dynamic typing, automatic memory management, and an extensive standard library."
    },
    {
        "input": "How does garbage collection work in Python?",
        "context": "Python uses reference counting and a cyclic garbage collector. When an object's reference count drops to zero, it is deallocated.",
        "expected_output": "Python uses reference counting for garbage collection. Objects are deallocated when their reference count reaches zero."
    }
])

def rag_task(item):
    # Simulate RAG pipeline
    output = "<LLM response placeholder>"

    return {
        "output": output
    }

# Run the evaluation
result = evaluate(
    dataset=dataset,
    task=rag_task,
    scoring_metrics=[
        ContextPrecision(),
        ContextRecall()
    ],
    experiment_name="rag_evaluation"
)
```

</Tab>
<Tab value="Using the Playground" title="Using the Playground">
    You can also use the Opik Playground to quickly evaluate different prompts and LLM models.

    To use the Playground, you will need to navigate to the [Playground](/prompt_engineering/playground) page and:

    1. Configure the LLM provider you want to use
    2. Enter the prompts you want to evaluate - You should include variables in the prompts using the `{{variable}}` syntax
    3. Select the dataset you want to evaluate on
    4. Click on the `Evaluate` button

    You will now be able to view the LLM outputs for each sample in the dataset:

    ![Playground](/img/evaluation/playground_evaluation.gif)

</Tab>
</Tabs>

## Analyzing Evaluation Results

Once the evaluation is complete, Opik allows you to manually review the results and compare them with previous iterations.

<Frame>
  <img src="/img/evaluation/experiment_items.png" />
</Frame>

In the experiment pages, you will be able to:

1. Review the output provided by the LLM for each sample in the dataset
2. Deep dive into each sample by clicking on the `item ID`
3. Review the experiment configuration to know how the experiment was Run
4. Compare multiple experiments side by side


### Analyzing Evaluation Results in Python

To analyze the evaluation results in Python, you can use the `EvaluationResult.aggregate_evaluation_scores()` method
to retrieve the aggregated score statistics:

```python
import opik
from opik.evaluation import evaluate_prompt
from opik.evaluation.metrics import Hallucination

# Create a dataset that contains the samples you want to evaluate
opik_client = opik.Opik()
dataset = opik_client.get_or_create_dataset("Evaluation test dataset")
dataset.insert([
    {"input": "Hello, world!", "expected_output": "Hello, world!"},
    {"input": "What is the capital of France?", "expected_output": "Paris"},
])
# Run the evaluation
result = evaluate_prompt(
    dataset=dataset,
    messages=[{"role": "user", "content": "Translate the following text to French: {{input}}"}],
    model="gpt-5",  # or model that you want to evaluate
    scoring_metrics=[Hallucination()],
    verbose=2,  # print detailed statistics report for every metric
)

# Retrieve and print the aggregated scores statistics (mean, min, max, std) per metric
scores = result.aggregate_evaluation_scores()
for metric_name, statistics in scores.aggregated_scores.items():
    print(f"{metric_name}: {statistics}")
```

You can use aggregated scores to compare the performance of different models or different versions of the same model.

### Experiment-Level Metrics

In addition to per-item metrics, you can compute experiment-level aggregate metrics that are calculated across all test results. These experiment scores are displayed in the Opik UI alongside feedback scores and can be used for sorting and filtering experiments.

<Tip>
  Learn more about computing experiment-level metrics in the [Evaluate your LLM application](/evaluation/evaluate_your_llm#computing-experiment-level-metrics) guide.
</Tip>

## Learn more

You can learn more about Opik's evaluation features in:

1. [Evaluation concepts](/evaluation/concepts)
1. [Evaluate prompts](/evaluation/evaluate_prompt)
1. [Evaluate threads](/evaluation/evaluate_threads)
1. [Evaluate complex LLM applications](/evaluation/evaluate_your_llm)
1. [Evaluation metrics](/evaluation/metrics/overview)
1. [Manage datasets](/evaluation/manage_datasets)
